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Abstract  Recently,  a  number  of  modeling  techniques  have  been  developed  for  data  mining 
and  machine  learning  in  relational  and  network  domains  where  the  instances  are  not  inde¬ 
pendent  and  identically  distributed  (i.i.d.).  These  methods  specifically  exploit  the  statistical 
dependencies  among  instances  in  order  to  improve  classification  accuracy.  However,  there 
has  been  little  focus  on  how  these  same  dependencies  affect  our  ability  to  draw  accurate  con¬ 
clusions  about  the  performance  of  the  models.  More  specifically,  the  complex  link  structure 
and  attribute  dependencies  in  relational  data  violate  the  assumptions  of  many  conventional 
statistical  tests  and  make  it  difficult  to  use  these  tests  to  assess  the  models  in  an  unbiased 
manner.  In  this  work,  we  examine  the  task  of  within-network  classification  and  the  question 
of  whether  two  algorithms  will  learn  models  that  will  result  in  significantly  different  levels 
of  performance.  We  show  that  the  commonly  used  form  of  evaluation  (paired  t-test  on  over¬ 
lapping  network  samples)  can  result  in  an  unacceptable  level  of  Type  I  error.  Furthermore, 
we  show  that  Type  I  error  increases  as  (1)  the  correlation  among  instances  increases  and  (2) 
the  size  of  the  evaluation  set  increases  (i.e.,  the  proportion  of  labeled  nodes  in  the  network 
decreases).  We  propose  a  method  for  network  cross-validation  that  combined  with  paired 
Mests  produces  more  acceptable  levels  of  Type  I  error  while  still  providing  reasonable  levels 
of  statistical  power  (i.e.,  1— Type  II  error). 
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1  Introduction 

The  seminal  work  of  Dietterich  [5]  focused  on  enumerating  the  types  of  statistical  questions 
that  analysts  could  ask  of  the  models  and  algorithms  that  they  develop  and/or  learn.  His  work 
outlined  a  taxonomy  of  questions  that  differentiates  between  algorithm  and  model  perfor¬ 
mance,  and  whether  the  goal  is  to  estimate  accuracy  or  to  choose  between  models/algorithms. 
Within  this  taxonomy,  Dietterich  formulated  the  question  that  is  most  central  to  data  mining 
and  machine  learning  research:  Given  two  learning  algorithms  A  and  B  and  a  dataset  of  size  S 
from  a  domain  D,  which  algorithm  will  produce  more  accurate  classifiers  when  trained  on 
other  datasets  of  size  S  drawn  from  D?  This  question  explicitly  formulates  the  notion  of 
generalization  and  provides  a  means  to  test  the  notion  statistically. 

Within  this  framework,  Dietterich  analyzed  the  characteristics  of  five  statistical  tests  that 
can  be  used  to  assess  generalization  performance  and  showed  that  two  of  the  tests  in  wide¬ 
spread  use  (at  that  time)  had  a  high  probability  of  Type  I  error  (i.e.,  the  tests  will  likely  lead 
to  an  erroneous  conclusion  of  algorithm  difference  when  there  is  none).  Overall,  Dietterich’ s 
work  showed  that  the  overlap  in  training/test  sets  combined  with  imbalanced  samples  can 
lead  to  higher  Type  I  errors  due  to  biased  estimates  of  mean  performance  difference  between 
two  algorithms.  Therefore,  a  methodology  that  reduces  the  overlap  between  the  training  and 
test  sets  leads  to  lower  Type  I  errors.  Based  on  this  analysis,  Dietterich  developed  a  novel 
5x2  cross-validation  test,  which  has  lower  Type  I  error  than  the  standard  cross-validation 
test  but  slightly  reduced  statistical  power  (i.e.,  higher  Type  II  error). 

However,  Dietterich’ s  work  only  considered  conventional  machine  learning  algorithms 
that  assume  i.i.d.  data.  In  this  work,  we  consider  the  task  of  comparing  algorithm  perfor¬ 
mance  in  the  context  of  within-network  relational  learning.  In  contrast  to  across -network 
relational  learning,1  in  which  a  model  is  learned  on  one  network  and  then  applied  to  another 
disconnected  network,  within-network  learning  aims  to  generalize  within  a  single  relational 
data  graph.  In  within-network  learning,  models  are  learned  on  a  partially  labeled  network  and 
then  applied  to  predict  the  class  labels  in  the  remainder  of  the  network  (i.e.,  the  unlabeled 
portion).  In  many  real  world  applications,  relational  learning  tasks  fall  naturally  into  the 
within-network  classification  setting.  For  example,  in  the  task  of  research  paper  classifica¬ 
tion,  new  papers  to  be  classified  usually  have  citation  links  to  papers  in  the  past  whose  topics 
are  known.  Similarly,  in  fraud  detection,  brokers  whose  fraud  status  is  yet  to  be  determined 
might  associate  with  other  brokers  who  have  already  been  identified  as  fraudulent  or  not. 

Within-network  relational  learning  tasks  have  two  characteristics  that  can  complicate  the 
application  of  conventional  statistical  tests  for  comparing  generalization  performance.  First, 
the  instances  in  the  network  are  not  independent.  Indeed,  relational  learning  algorithms  are 
specifically  trying  to  exploit  the  dependencies  among  instances  to  improve  prediction  accu¬ 
racy.  The  dependencies  among  instances,  however,  tend  to  result  in  correlated  errors  among 
the  instances.  These  correlated  errors  can  increase  the  imbalance  between  network  samples, 
and  this  can  lead  to  increased  Type  I  errors.  Second,  the  size  of  the  training  and  test  sets 
are  dependent,  and  thus  as  the  proportion  of  (labeled)  training  data  decreases,  the  size  of 
the  test  set  increases.  This  results  from  the  fact  that  the  models  are  learned/applied  to  a  par¬ 
tially  labeled  network  with  varying  levels  of  labeled  instances,  and  the  full  set  of  unlabeled 


1  We  use  the  more  general  term  relational  learning  to  refer  to  both  within-network  and  across-network 
learning. 
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Fig.  1  A  taxonomy  of  statistical  questions  in  statistical  relational  machine  learning.  The  white  box  contains 
Dietterich’s  original  taxonomy  of  questions  for  i.i.d.  classifiers.  All  questions  in  the  hierarchy  are  applicable 
to  relational  classifiers 


instances  are  typically  used  for  evaluation.  As  the  size  of  the  unlabeled  set  increases,  the 
dependencies  between  samples  increase,  and  this  can  also  lead  to  increased  Type  I  errors. 

Figure  1  outlines  a  taxonomy  of  statistical  questions  for  relational  classifiers  that  builds 
on  the  taxonomy  proposed  by  Dietterich  [5].  Our  taxonomy  includes  two  additional  dimen¬ 
sions  of  variation:  (1)  transfer  of  knowledge  within  a  single  network  (i.e.,  within-network 
classification)  versus  across  multiple  networks  (i.e.,  across-network  classification)  and  (2) 
the  proportion  of  known  labels  available  in  the  test  network  (small  vs.  large).  While  the  focus 
of  this  study  is  within-network  classification,  our  findings  will  also  apply  to  across-network 
classification  if  the  test  networks  are  partially  labeled.  The  distinction  between  a  small  ver¬ 
sus  large  proportion  of  available  labels  is  important  because  with  a  sufficiently  small  label 
set  (i.e.,  <50%),  standard  cross-validation  is  no  longer  applicable  as  test  sets  will  start  to 
overlap — leading  to  an  increase  in  Type  I  error.  Our  study  considers  a  range  of  labeling 
proportions,  spanning  the  “small”  and  “large”  label-proportion  categories. 

More  specifically,  in  this  paper,  we  consider  the  following  question:  Given  two  learning 
algorithms  A  and  B  and  a  partially  labeled  network  from  domain  D,  and  with  Sl  labeled 
instances  and  Sjj  unlabeled  instances  (S  =  Sl  +  Su),  which  algorithm  will  produce  more 
accurate  classifiers  when  trained  on  other  partially  labeled  networks  of  size  S  drawn from  D? 
We  consider  statistical  tests  that  can  be  used  to  evaluate  this  question  and  discuss  the  char¬ 
acteristics  of  relational  data  that  can  complicate  accurate  evaluation.  We  show  theoretically 
how  these  network  characteristics  can  lead  to  increased  Type  I  error,  then  we  investigate 
the  nature  of  this  bias  empirically.  Our  empirical  investigation  compares  the  performance 
of  a  number  of  common  statistical  tests,  using  both  simulated  and  real  classifiers,  and  both 
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synthetic  and  real  datasets.  The  experimental  methodology  and  results  for  each  combination 
can  be  found  in  the  following  sections: 

-  Theoretical  Analysis:  Sect.  4 

-  Simulated  Classifiers/Synthetic  Data:  Sect.  5 

-  Real  Classifiers/Synthetic  Data:  Sect.  6 

-  Real  Classifiers/Real  Data:  Sect.  7 

Our  findings  indicate  that  a  commonly  used  method  of  statistical  assessment — paired 
t-tests  on  repeated  samples  of  randomly  selected  network  samples  (labeled  training  set  and 
unlabeled  test  set) — results  in  unacceptably  high  levels  of  Type  I  error.  We  propose  a  method 
for  network  cross-validation  that  combined  with  unpaired  t- tests  produces  low  levels  of  Type  I 
error  at  the  expense  of  reduced  statistical  power.  Combining  network  cross-validation  with 
paired  t-tests  is  a  good  compromise,  resulting  in  both  acceptable  levels  of  Type  I  error  and 
reasonable  levels  of  statistical  power. 

This  article  builds  on  preliminary  work  [24].  The  most  significant  additions  for  this  article 
include: 

-  Development  of  a  taxonomy  of  statistical  questions  for  comparing  network  classifiers. 

-  Development  of  a  theoretical  explanation  of  how  relational  data  characteristics  lead  to 
type  I  error  in  standard  statistical  tests. 

Additional  contributions  of  the  overall  work  include: 

-  Formulation  of  important  statistical  questions  for  comparing  network  classifiers. 

-  Discussion  and  demonstration  of  the  challenges  of  relational  data  for  using  conventional 
statistical  tests  to  compare  classifiers. 

-  A  proposed  solution,  network  cross-validation ,  which  addresses  these  challenges. 

-  Empirical  evaluation  of  statistical  test  characteristics  (Type  I  error/power)  on  real-world 
and  synthetic  data. 


2  Comparing  classifiers  in  network  domains 

Statistical  tests  for  comparing  classifiers  generally  consist  of  two  parts:  (1)  The  resampling 
procedure  dictates  how  the  available  data  are  partitioned  into  training  and  test  sets  for  estima¬ 
tion  of  classifier  performance  (i.e.,  how  many  times  is  the  classifier  trained  and  tested ?,  which 
data  are  used  to  train  the  classifier ?,  which  data  are  used  to  test  the  classifier?)  and  (2)  The 
significance  test  takes  the  classification  results  from  the  resampling  trials  and  makes  a  deter¬ 
mination  as  to  whether  observed  differences  reflect  a  true  difference  in  classifier  performance 
or  whether  it  is  likely  to  have  occurred  by  chance  alone. 

2. 1  Resampling  procedures 

Given  a  fully  labeled  network  of  size  S,  we  consider  three  resampling  procedures  to  generate 
training  (labeled  set  Sl)  and  test  (unlabeled  set  %)  sets  to  evaluate  within-network  classi¬ 
fication  algorithms:  simple  random  resampling  (RRS),  equal-instance  random  resampling 
(ERS),  and  network  cross-validation  (NCV).  The  first  two  methods  have  been  used  exten¬ 
sively  in  past  work  on  relational  learning  algorithms  (see  Sect.  3  for  more  detail).  The  third 
method  is  a  new  approach,  based  on  the  incremental  cross-validation  procedure  outlined  in 
Cohen  [2]  for  generating  learning  curves,  which  we  will  show  is  more  robust  to  Type  I  error. 
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Table  1  Simple  random  resampling  procedure 

input:  network ,  propLabeled,  k 
S  =  total  number  of  instances  in  network 
F  =  0 

for  fold  1  to  k 

test  Set  =  uniform  random  sample  of  ((1  —  propLabeled )  *  S )  nodes  from  network 
trainSet  =  net  work  —  test  Set 
F  =  F  U  <  trainSet,  testSet  > 

end  for 
output:  F 


Table  2  Equal-instance  resampling  procedure 

input:  network,  propLabeled,  k 
S  =  total  number  of  instances  in  network 

II  Split  data  into  overlapping  folds  so  that  each  instance  occurs  in  the  same  number  of  folds. 
testSetSize=  (1  —  propLabeled)  *  S 
numCopies  =  k*,es,f,Size 

pool  =  sorted  list  with  numCopies  of  each  instance  in  network 
testSet\i]  =  { },  for  i  =  1  to  k 
for  instance  e  pool 

add  instance  to  smallest  test  Set  [i]  that  does  not  already  contain  it 

end  for 

//  create  training/test  splits 
F  =  0 

for  testSet  1  to  k 

trainSet  =  network  —  testSet 
F  =  F  U  <  trainSet,  testSet  > 

end  for 
output:  F 


Tables  1  and  2  outline  the  procedures  for  RRS  and  ERS,  respectively.  Both  methods  involve 
repeated  random  draws  from  the  sample  population  to  generate  the  training/test  splits  and 
therefore  produce  overlapping  test  sets.  However,  ERS  ensures  that  each  instance  in  the 
original  sample  occurs  in  exactly  the  same  number  of  test  sets  in  the  collection  of  resamples. 

Table  3  outlines  the  NCV  procedure  that  eliminates  overlap  between  test  sets  altogether. 
The  procedure  samples  for  k  disjoint  test  sets  that  will  be  used  for  evaluation.  Then  for  each 
test  set  fold,  the  remaining  folds  are  merged  together,  and  the  training  set  of  size  Sp  is  ran¬ 
domly  sampled  from  the  merged  set.  When  the  training  set  size  is  less  than  the  size  of  the 
merged  folds  (i.e.,  Sl  <  (k  —  1)| ),  this  will  leave  a  set  of  unlabeled  nodes  that  are  neither 
in  the  test  set  nor  the  training  set.  Since  these  unlabeled  instances  will  likely  be  connected 
to  nodes  in  the  test  set,  we  will  run  collective  inference  over  the  full  set  of  unlabeled  nodes 
(the  inference  set),  and  then  only  evaluate  model  performance  on  the  nodes  assigned  to  the 
test  set. 
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Table  3  Network  cross-validation  procedure 

given:  network,  propLabeled,  k 
S  =  total  number  of  instances  in  network 
F  =  0 

Split  data  into  k  disjoint  folds 
for  fold  1  to  k 

current  fold  becomes  test  Set 

remaining  folds  are  merged  and  become  train  Pool 

trainSet  =  uniform  random  sample  of  ( propLabeled  *  S )  nodes  from  train  Pool 
inferenceSet  =  network  —  trainSet 
F  =  F  U  <  trainSet,  test  Set,  inferenceSet  > 

end  for 
output:  F 


NCV  addresses  a  limitation  of  standard  cross-validation  for  within-network  classification 
tasks.  Namely,  standard  CV  forces  us  to  label  k  —  1  of  every  k  instances.  So,  if  k  =  10, 
we  are  forced  to  experiment  with  90%  labeled  data.  The  NCV  approach  accommodates  a 
lower  proportion  of  labeled  instances  because  it  samples  a  smaller  labeled  set  from  the  k  —  1 
non-test  folds.  Since  NCV  applies  collective  inference  to  the  full  unlabeled  portion  of  the 
network,  it  can  fully  exploit  the  network  structure  to  improve  model  predictions.  However, 
since  NCV  only  evaluates  the  model  on  the  disjoint  test  set  instances,  it  will  not  suffer  the 
same  problems  experienced  by  resampling  due  to  overlapping  test  sets. 

2.2  Significance  tests 

Once  a  sampling  procedure  is  chosen  to  create  training/test  splits  within  a  network,  the 
algorithms  are  learned  on  each  training  set,  and  then  the  models  are  applied  for  collective 
inference  over  the  associated  unlabeled  portion  of  the  network.  The  predictions  on  the  test 
set  instances  are  evaluated  to  generate  an  estimate  of  algorithm  performance  (e.g.,  accuracy, 
AUC,  squared  loss).  The  training/test  splits  result  in  a  set  of  performance  measurements  for 
each  algorithm,  and  a  significance  test  is  then  used  to  determine  whether  the  observed  perfor¬ 
mance  differences  are  significantly  different  than  what  would  be  expected  if  the  performance 
measures  were  drawn  from  the  same  underlying  distribution  (i.e.,  the  algorithms  perform 
equivalently). 

In  this  work,  we  investigated  the  following  three  statistical  tests:  (1)  paired  t-test,  (2) 
unpaired  t-test,  and  (3)  Wilcoxon  signed  rank  test.  Both  the  paired  t-test  and  Wilcoxon  signed 
rank  test  assume  independence  of  the  paired  differences  between  classifiers.  We  observed  no 
substantive  differences  due  to  the  use  of  the  Wilcoxon  test  versus  the  t-test.  Therefore,  we 
focus  on  the  more  commonly  used  t-test  for  the  remainder  of  this  paper. 


3  Survey  of  previous  methodology 

Over  the  past  10  years,  there  has  been  a  great  deal  of  work  on  classifiers  for  relational  domains. 
Here,  we  survey  the  methodological  design  of  23  research  papers  most  relevant  to  our  work 
[1,7-12,16-21,25-29,31,34,35,38,40].  Relevant  papers  compare  the  performance  of  two 
or  more  classifiers  on  a  relational  classification  task.  Based  on  our  survey,  there  are  two 
common  evaluation  methodologies  which  emerge: 
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Independent  set  size  The  salient  feature  of  the  independent  set  size  methodology  is  that  there 
is  no  dependency  between  training  and  test  set  sizes.  The  task  may  be  either  within-network 
or  across-network  classification  (i.e.,  there  may  or  may  not  be  relations  between  instances 
in  the  labeled  and  unlabeled  set).  For  across-network  classification,  classifiers  are  generally 
trained  on  a  fully  labeled  training  network  and  then  evaluated  on  a  disjoint  (partially  labeled) 
test  network.  In  this  case,  the  proportion  of  labeled  data  available  in  the  training  and  test  net¬ 
works  may  be  varied  independently.  This  means  that  there  is  no  dependence  between  training 
and  test  set  sizes.  For  within-network  classification,  there  is  a  single  network,  and  classifiers 
are  trained  on  the  labeled  instances  and  evaluated  on  the  unlabeled  instances.  So,  the  only 
way  to  achieve  independent  set  sizes  in  within-network  classification  is  to  fix  the  proportion 
of  labeled  data  available  in  the  network.  All  of  the  papers  in  our  survey  that  employ  inde¬ 
pendent  set  sizes  (15  out  of  the  23)  use  some  form  of  random  resampling  or  cross-validation 
for  evaluation. 

Dependent  set  size  This  methodology  applies  to  within-network  classification  only,  where 
training  and  test  sizes  are  dependent.  Any  change  in  the  labeling  proportion  over  the  network 
will  affect  the  number  of  instances  available  for  both  training  and  testing.  All  of  the  papers 
in  our  survey  that  employ  dependent  set  sizes  (9  out  of  the  23)  use  some  sort  of  random 
resampling  for  evaluation  (i.e.,  test  sets  overlap).  Note  that  standard  cross-validation  is  not 
possible  here  since  it  assumes  a  fixed  proportion  of  labeled  data. 

Both  of  the  above  methodologies  generally  address  the  statistical  question  that  we  con¬ 
sider  in  our  work:  Given  two  learning  algorithms  A  and  B  and  a  partially  labeled  net¬ 
work  from  domain  D,  and  with  Sl  labeled  instances  and  Sjj  unlabeled  instances  (S  = 
Sl  +  Su  )>  which  algorithm  will  produce  more  accurate  classifiers  when  trained  on  other 
partially  labeled  networks  of  size  S  drawn  from  D?  However,  the  independent  set  size  meth¬ 
odology  makes  a  rather  strong  assumption  about  the  value  of  Sl  .  In  particular,  most  stud¬ 
ies  that  employ  independent  set  sizes  use  10-fold  cross-validation,  which  means  that  their 
results  only  generalize  to  graphs  where  90%  of  the  data  are  labeled  to  begin  with.  This  is 
limiting  because:  (1)  most  interesting  real-world  problems  have  far  less  than  90%  of  data 
labeled  to  begin  with  and  (2)  many  algorithms  that  perform  well  at  90%  labeled  will  per¬ 
form  poorly  at  sparser  labelings  (e.g.,  10%).  The  dependent  set  size  methodology  is  more 
powerful  since  it  generalizes  over  different  values  of  Sl  ,  so  we  focus  on  this  version  in  our 
work. 

Table  4  provides  a  summary  of  related  work  along  a  number  of  methodological  dimen¬ 
sions.  Note  that  counts  in  each  section  do  not  necessary  sum  to  23  since  papers  may  fit  in  more 
than  one  category.  The  majority  of  the  studies  in  our  survey  (13/23)  make  use  of  resampling 
procedures  that  produce  overlap  between  test  sets.  This  includes  all  resampling  procedures 
except  cross-validation  and  temporal  sampling.  Temporal  sampling  involves  training  on  past 
instances  (e.g.,  previously  published  papers  with  known  topics)  and  evaluating  on  present 
instances  (e.g.,  a  newly  submitted  paper  with  unknown  topic).  Controlled  random  sampling 
procedures  attempt  to  control  or  account  for  the  amount  of  overlap  between  test  sets  (e.g.,  as 
in  the  equal-instance  resampling  procedure  described  in  Sect.  2.1). 

In  our  survey,  within-network  classification  tasks  are  more  common  than  across-net¬ 
work  tasks  (13  papers  vs.  8).  However,  in  a  substantial  number  of  cases,  it  is  unclear 
exactly  how  the  experiments  are  set  up.  For  example,  authors  will  often  say  something 
like:  we  split  the  network  into  training  and  test  sets.  It  is  not  clear  from  this  description 
whether  the  links  are  preserved  between  instances  in  the  training  set  and  instances  in  the 
test  set.  In  other  cases,  authors  are  explicit  regarding  whether  such  links  are  retained  or 
removed. 
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Table  4  Experimental  characteristics  of  23  surveyed  papers 


Resampling  procedure 

Systematic  variation  of  %  labeled 

Cross  validation 

14 

No 

13 

Simple  random 

8 

Yes 

10 

Controlled  random 

3 

Within-network 

8 

Snowball  sampling 

2 

Across  network 

2 

Temporal  resampling 

2 

Statistical  test 

Number  of  resampling  folds 

t-test 

10 

10 

14 

StDev/Var/StErr 

6 

<10 

7 

None 

6 

>10 

2 

Wilcoxon  signed  rank 

2 

Unspecified 

2 

Within  versus  across  network  classification 

Performance  measure 

Within-network 

13 

Accuracy 

14 

Across  network 

8 

AUC 

10 

Unspecified 

6 

Precision/Recall/F  1 

1 

Note  that  section  counts  do  not  necessary  sum  to  23  since  papers  can  fit  in  >1  category 


The  majority  of  studies  in  our  survey  (13/23)  do  not  vary  the  proportion  of  labeled  data 
available.  Of  the  studies  that  do  vary  the  proportion  of  labeled  data,  most  (8/10)  are  within- 
network  studies  (i.e.,  dependent  set  size). 

The  most  common  number  of  resampling  folds  used  in  the  surveyed  papers  is  10.  As 
Dietterich  notes  in  his  original  study,  the  probability  of  Type  I  error  for  random  resampling 
procedures  increases  with  the  number  of  resampled  folds  [5].  Our  simulation  experiments 
confirm  this  finding,  but  we  do  not  replicate  the  result  here. 

Half  of  the  studies  in  our  survey  do  not  make  use  of  an  explicit  significance  test.  However, 
of  these,  about  half  do  report  standard  deviation,  variance,  or  standard  error  ( StDev/Var/StErr 
in  Table  4).  Finally,  both  accuracy  and  AUC  are  common  measures  of  classifier  performance, 
with  precision/recall-based  measures  being  much  less  common. 

In  addition  to  the  related  work  discussed  above,  there  have  been  other  works  on  learning 
and  generalization  bounds  from  non-i.i.d.  observations  (e.g.  [3,4,6,14,15,22,23,30,32,33, 
37,39]).  However,  none  of  them  address  the  problems  of  (1)  dependence  between  related 
instances  and  (2)  dependence  between  training  and  test  set  sizes.  Usunier  et  al.  [36]  pro¬ 
posed  a  new  framework  to  study  the  generalization  properties  of  classifiers  over  data  which 
can  exhibit  a  suitable  dependency  structure.  However,  their  focus  was  solely  on  dependent 
training  sets. 


4  Characteristics  leading  to  bias  in  statistical  tests 

As  discussed  previously,  there  are  two  characteristics  of  within-network  relational  classifi¬ 
cation  that  can  lead  to  increased  levels  of  Type  I  error: 

1.  Inter- instance  dependencies  lead  to  correlated  errors. 

2.  Small  training  sets  lead  to  large  test  sets,  increasing  the  dependence  between  samples. 
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Table  5  Error  correlation  and 
relational  autocorrelation  in 
real-world  classification  tasks 


Enron  email 
Citeseer 


Data  set 


Task 


Executive? 


0.18 

0.23 

0.25 

0.28 

0.32 

0.52 


Error  corr. 


0.17 

0.59 

0.22 

0.61 

0.79 

0.91 


Autocorr. 


Political  books 
Cora 


Neural  nets? 
Neutral? 

Info,  retrieval? 


Reality  mining 
Reality  mining 


In  study? 
Student? 


Here,  we  will  discuss  these  two  issues  in  more  detail,  providing  both  empirical  and  theo¬ 
retical  support  for  our  claims. 

First,  we  demonstrate  that  within-network  classifiers  produce  correlated  errors.  To  test  the 
conjecture  that  within-network  classifiers  produce  correlated  errors,  we  experimented  with 
several  relational  classifiers  and  real-world  classification  tasks,  using  the  </>  coefficient  to 
measure  the  correlation  of  0-1  errors  over  all  pairs  of  related  (i.e.,  linked)  instances.  We  used 
a  non-learning  relational  neighbor  classifier  [18]  and  a  learning  link-based  classifier  [17]. 
We  ran  each  classifier  both  with  and  without  collective  inference  on  a  number  of  prediction 
tasks.  Table  5  shows  the  amount  of  measured  error  correlation  for  each  task,  averaged  over 
all  classifiers  and  proportions  of  labeled  data  (0.1,  0.3,  0.5,  0.7,  0.9).  Although  we  report 
averages,  we  should  note  that  all  trials  (i.e.,  tasks/classifiers)  exhibited  some  degree  of  error 
correlation.  We  can  observe  that  the  level  of  error  correlation  is  correlated  with  the  level  of 
relational  autocorrelation  (see  e.g.,  [26])  in  the  class  label.  Since  autocorrelation  has  been 
shown  to  be  essentially  ubiquitous  in  relational  data,  this  suggests  that  error  correlation  is 
widespread  as  well. 

Next,  we  show  theoretically  how  error  correlation  increases  the  variance  of  average  error — 
this  increase  in  variance,  if  not  accounted  for  in  the  statistical  test  will  lead  to  increased  Type  I 
error. 

Theorem  1  (Correlated  errors  increase  variance)  Let  A ;  be  a  random  variable  that  repre¬ 
sents  the  classification  error  for  an  instance  i  and  assume  that  each  r.v.  A;  is  drawn  from  the 
same  distribution  with  mean  \i  and  variance  o2.  Let  X  =  ^  X/eS  %i  b e  a  random  variable 
that  represents  the  average  classification  error  in  a  sample  S  =  {A/}  with  n  instances  (i.e., 
|S|  =  n).  Assume  that  the  sample  S  consists  of  n  instances  drawn  equally  from  a  set  of 
groups  G,  where  |G|  =  m  and  each  group  Gk  G  G  has  average  size  g  (i.e.,  m  •  g  =  n). 
Let  p  be  the  average  correlation  between  the  A;  that  are  members  of  the  same  group, 
otherwise  assume  the  A/  are  independent.  Then  the  average  classification  error  X  has  vari- 
ance  Var(X)  =  \a2  [1  +  p{g  -  1)]. 

Proof 


Var(X)  =  Var 
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We  refer  to  this  variance  of  the  average  error,  when  there  is  error  correlation,  as  V arcorr  (X). 
Note  that  other  than  for  very  specific  graph  structures  (e.g.,  bipartite  graphs),  if  relational 
data  are  correlated,  autocorrelation  is  positive  and  p  will  be  greater  than  zero.  Thus,  as  p 
or  g  (i.e.,  group  size)  increases,  V  arcorr(X)  also  increases.  If  this  correlation  is  ignored  and 
error  is  estimated  under  the  assumption  of  independence  (i.e.,  with  \cr2),  then  the  variance 
will  be  underestimated  by  p(g  —  1)  [^or2]. 

In  Sect.  2.1,  we  described  how  existing  resampling  procedures  create  dependent  (i.e., 
overlapping)  test  sets.  Here,  we  show  theoretically  how  overlapping  samples  decreases  the 
variance  of  observed  error.  For  the  proof  below,  we  consider  the  case  of  independent  data 
instances  and  show  the  effect  on  variance  when  a  set  of  instances  appears  repeatedly  in  each 
sample  (i.e.,  test  set).  Then,  in  the  next  theorem,  we  will  show  how  the  effect  of  overlap 
combines  with  the  effect  due  to  correlated  instances  and  together  lead  to  bias  in  statistical 
tests. 


Theorem  2  (Overlap  in  samples  reduces  variance)  Let  Xi ,  X,  and  S  be  defined  as  in  The¬ 
orem  1.  Assume  that  every  sample  S  has  a  subset  S'  of  independent  r.v.s  and  a  subset  C  of 
common  r.v.s,  where  the  set  S'  is  of  size  n  —  c  and  the  set  C  consists  of  c  r.v.s  with  unknown 
values,  but  the  values  do  not  vary  across  samples  (i.e.,  S  =  S'  +  C  =  {Xj  }n-c  +  {xi  }c).  Then 
Var(X)  =  la2  [l  -  £]. 
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Proof 
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We  refer  to  this  variance  of  the  average  error,  when  there  is  overlap  between  samples,  as 
VaroviP(X).  As  c  increases,  Varovip(X )  decreases.  Note  that  for  samples  with  n  indepen¬ 
dent  samples,  the  variance  of  the  average  performance  is  \cr1,  thus  when  overlap  is  ignored 
the  variance  will  be  underestimated  by  <?  2 . 

Finally,  we  show  how  these  two  effects  combine  together  to  bias  conventional  statistical 
tests  for  network  domains. 

Theorem  3  (Sample  overlap  and  correlation  increase  likelihood  of  Type  I  error)  Let  algo¬ 
rithm  A  and  algorithm  B  have  equal  classification  error  rates  on  data  drawn  from  the 
same  domain  D.  Let  Xi  be  the  classification  error  for  an  instance  i  in  domain  D  and 
assume  that  Xf  and  Xf  (the  error  made  by  algorithm  A  and  B  respectively )  are  drawn 

ll  9 

from  the  same  distribution  with  mean  /x  and  variance  o  .  Furthermore,  assume  that  a  data¬ 
set  of  n  instances  from  D  consists  of  a  set  of  groups  G,  where  |G|  =  m  and  each  group 
Gk  G  G  has  average  size  g  (i.e.,  m  •  g  =  n).  Let  p  be  the  average  correlation  between 
the  Xi  that  are  members  of  the  same  group,  otherwise  assume  the  Xi  are  independent.  Let 
XA  =  {Xf,  XA  ,  . . . ,  XA}  and  XB  =  {Xf ,  X^ ,  . . . ,  X be  the  estimates  of  average  error 
for  s  =  [1,  k]  samples  drawn  with  replacement  (i.e.,  repeated  sampling)  from  a  given  data¬ 
set  of  size  n  in  the  domain  D.  Assume  that  there  is  a  set  instances  C,  where  |C|  =  c,  that 
is  common  to  all  k  samples  and  assume  otherwise  the  samples  are  independent.  Then  an 
unpaired  t-test  over  XA  and  XB  will  underestimate  the  variance  of  the  null  distribution  by: 

A  =  ^2[(f)(l+Pte-D[^  +  ^])] 

Proof  The  observed  variance  Var(X)  will  have  effects  due  to  both  instance  correlation  and 
overlapping  samples: 

Var(X)  =  Var  l  -  ^  X; 

\  Xj  eS 
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We  refer  to  this  variance  of  the  average  error,  when  there  is  both  overlap  between  sam¬ 
ples  and  error  correlation,  as  Var0bs(X).  The  unpaired  t-test  uses  the  average  (i.e.,  pooled) 
variance  of  XA  and  XB  for  the  null  distribution.  Since  the  error  distribution  of  A  and  B  is 
equal,  the  average  is  equal  to  the  variance  of  a  single  algorithm.  When  the  samples  have 
independent  instances  but  c  instances  are  common  to  each  sample,  we  know  from  The¬ 
orem  2  that  the  variance  of  X  will  be  the  following:  Varovip(X)  =  ^a2  [l  —  |].  How¬ 
ever,  when  there  is  correlation  p  among  the  instances  in  the  data  but  the  samples  them¬ 
selves  are  independent,  from  Theorem  1,  we  know  that  the  variance  of  X  is  the  following: 
Varcorr(X)  =  l<?2  [1  +  p(g  -  1)]. 

Since  the  two  algorithms  are  equal,  the  variance  of  the  null  distribution  should  corre¬ 
spond  to  the  variance  with  independent  samples  V arcorr  (X).  However,  the  observed  variance 
Var0bs(X)  will  be  used  in  the  t-test,  resulting  in  an  underestimate  of  A: 


A  —  Varcorr(X)  VcLr0fos(X) 
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As  p  (the  amount  of  error  correlation)  or  c  (the  amount  of  overlap  between  samples)  increases, 
the  amount  of  underestimation  increases  (i.e.,  A).  This  increases  the  probability  of  a  Type  I 
error  in  the  following  way.  For  unpaired  tests,  the  t-statistic  is: 

XA-XB 

t  = - - - -= 

y/Var(X)  •  yjl 

where  Var(X)  is  the  pooled  variance.  Since  Var0bs(X)  <  Varcorr{X ),  the  result  will  be 
that  t0bs  >  tCOrr  and  thus  P(t0bs\T)  <  P(iCOrr\T),  where  T  is  the  t  distribution  with  2k  —  2 
degrees  of  freedom.  Thus,  using  Var0bs(X)  instead  of  Varcorr(X ),  it  is  more  likely  that  the 
null  hypothesis  will  be  rejected  even  when  it  holds,  and  as  such  Type  I  error  will  increase. 

This  effect  will  impact  paired  t-tests  in  a  similar  way,  as  the  decrease  in  observed  variance 
of  XA  and  XB  will  also  result  in  an  underestimate  of  the  difference  variance  Var(XA  —  XB ), 
which  is  used  instead  of  the  pooled  variance. 


5  Empirical  investigation  with  simulated  classifiers 

Type  I  errors  occur  when  a  statistical  test  incorrectly  rejects  the  null  hypothesis  (i.e,  the  test 
concludes  that  there  is  a  significant  difference  between  two  classifiers  when  there  is  none). 
We  run  simulation  experiments  in  order  to  assess  the  contributions  of  two  key  factors  in  Type  I 
errors  for  within-network  classification:  (1)  correlation  of  errors  among  related  data  instances 
and  (2)  dependence  between  samples.  We  also  assess  the  potential  of  various  statistical  tests 
to  produce  Type  I  errors  in  a  within-network  classification  setting. 

Our  method  preserves  the  basic  structure  of  Dietterich’s  [5],  but  introduces  a  group-based 
model  to  more  easily  represent  sets  of  related  instances  and  vary  the  degree  of  error  cor¬ 
relation  among  them.  We  also  ran  Dietterich’s  original  procedure  with  qualitatively  similar 
results. 

5.1  Methodology 

As  we  have  seen,  real  classifiers  exhibit  correlated  errors  on  sets  of  related  instances.  To  sim¬ 
ulate  this  behavior,  we  divide  all  data  instances  into  disjoint  groups  such  that  classification 
errors  are  more  likely  to  be  correlated  on  instances  within  a  group  than  on  instances  from 
different  groups. 
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Table  6  Simulation  algorithm 

for  simulation  1  to  10 
for  trial  1  to  1000 

(iVCV*)  Create  sample  and  split  into  10  disjoint  folds 
for  propLabeled  e  (0.1,  0.3,  0.5,  0.7,  0.9) 

(RS*)  Create  sample  and  resample  30  folds 
for  each  fold 

run  groupBasedClassification{fold) 

end  for  each 

Apply  significance  test  to  all  folds  in  trial / propLabeled 

Accept  or  reject  null  hypothesis 

Measure  error  correlation  for  this  trial / propLabeled 

end  for 
end  for 

Calculate  null  hypotheses  rejection  rate  for  simulation 
Calculate  mean  error  correlation  for  this  simulation 

end  for 

Calculate  final  mean  rejection  rate  and  error  correlation 
NCV*\  Performed  for  NCV  ( see  Table  3)  only. 

RS*:  Performed  for  RRS  and  ERS  (see  Tables  1-2)  only. 

See  Table  7  for  groupBasedClassification  procedure 


Table  6  outlines  our  simulation  algorithm  for  both  the  resampling  procedures  and  the 
network  cross-validation  procedure.  The  two  procedures  differ  only  in  that:  (1)  they  use 
different  resampling  algorithms  to  create  their  test  sets  and  (2)  NCV  uses  the  same  samples 
and  folds  across  all  proportions  of  labeled  data,  whereas  the  resampling  procedures  choose 
a  different  sample  and  random  split  for  each  trial  and  proportion  labeled. 

We  simulate  drawing  a  network  sample  s  from  an  underlying  population  by  creating  300 
instances  and  assigning  one  of  10  groups  to  each  instance  uniformly  at  random  (a  skewed 
group-size  distribution  produced  qualitatively  similar  results).  We  then  resample  s  and  run  a 
simulated  classification  experiment  on  each  resampled  train/test  split  (see  Table  7).  For  each 
trial,  we  apply  a  significance  test  to  either  accept  or  reject  the  null  hypothesis.  We  calculate 
the  proportion  of  trials  for  which  the  null  hypothesis  was  rejected.  Since  the  simulation  is 
designed  so  each  classifier  has  the  same  error  rate  in  the  underlying  population,  any  rejections 
of  the  null  hypothesis  represent  Type  I  errors.  In  addition,  we  measure  the  degree  of  error 
correlation  for  each  trial.  We  use  the  0  coefficient  to  measure  the  pairwise  correlation  of  0-1 
errors  among  all  instances  in  the  same  group,  averaged  over  all  groups. 

5.2  Results 

For  all  experiments,  we  present  average  Type  I  error  rates  for  various  statistical  tests  over 
10  simulations  of  1000  trials  each,  on  data  samples  of  size  300  instances.  Unless  otherwise 
noted,  our  default  experimental  parameters  are  classifier  error  rate  P  (err)  =  0.1,  error  corre¬ 
lation  parameter  errCorr  =  0.9,  and  proportion  of  labeled  instances  propLabeled  =  0.9. 
The  errCorr  parameter  determines  the  likelihood  of  misclassifying  instances  in  the  chosen 
misclassification  group  versus  instances  in  other  groups. 
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Table  7  Group-based  classifier  simulation  algorithm 

groupBasedClassification(fold) 

NG  =  total  number  of  groups 
M  =  round(NG  *  P(err )) 

P(MG )  =  P(err)  +  errCorr  *  (1  —  P(err )) 

P(MG')  =  P(err)  +  \~_Pp(^ 

II  a  real  classifier  would  normally  train  here,  but  there  is  no  training  phase  since  we  are 

II  simulating  classification 

II  choose  groups  to  misclassify 

MG  a  =  random  set  of  M  groups  from  1  to 

MGp  =  random  set  of  M  groups  from  +  1  to  NG 

for  each  instance  i  e  fold 

II  simulate  application  of  classifier  A 
if  group (i)  e  MG  a 

i  misclassified  by  classifier  A  with  P{MG) 

else 

i  misclassified  by  classifier  A  with  P(MG') 

end  if 

//  simulate  application  of  classifier  B 
if  group(i )  G  MGp 

i  misclassified  by  classifier  B  with  P{MG ) 

else 

i  misclassified  by  classifier  B  with  P(MG') 

end  if 

end  for  each 

This  method  ensures  that  classifiers  A  and  B  have  the  same  error  rate,  while  still  making  different  kinds  of 
errors  (i.e.,  A  misclassifies  different  groups  from  B).M  is  the  number  of  groups  chosen  for  misclassification  by 
each  classifier.  P(MG )  is  the  misclassification  probability  for  instances  of  the  chosen  groups  (MG^/MGg) 
and  P(MG')  is  the  misclassification  probability  for  instances  of  all  other  groups.  P(err)  is  the  overall  error 
rate  of  both  classifiers  A  and  B  and  errCorr  controls  the  degree  of  error  correlation  among  instances  within 
MGa/MGb 

Figure  2a  shows  the  effects  of  varying  the  proportion  of  labeled  data  available  (0. 1 , 0.3, 0.5, 
0.7,  0.9).  For  both  resampling  procedures,  the  Type  I  error  rate  increases  as  propLabeled 
decreases.  This  result  is  expected  since  the  degree  of  overlap  between  test  sets  increases 
as  the  test  sets  become  larger  due  to  the  larger  number  of  unlabeled  instances.  Since  NCV 
disallows  overlapping  test  sets  by  design,  it  is  not  susceptible  to  this  problem,  achieving  low 
Type  I  error  rates  across  the  range  of  propLabeled  values. 

Figure  2b  shows  the  effects  of  measured  error  correlation  on  Type  I  error  rate  as  we  vary 
P(err)  =  [0.1,  0.2,  0.3,  0.4]  and  errCorr  =  [0,  0.2,  0.4,  . . . ,  1.0].  We  can  observe  that  the 
Type  I  error  rate  of  the  resampling  procedures  increase  as  the  error  correlation  increases.  This 
is  not  surprising  since  increased  error  correlation  is  expected  to  lead  to  increased  imbalance 
between  samples.  In  addition,  we  note  that  our  experiments  showed,  for  a  fixed  value  of 
P(err )  (errCorr),  both  the  measured  type  I  error  rate  and  the  measured  error  correlation 
increased  monotonically  with  errCorr  (P(errf).  Overall,  NCV  is  less  affected  by  imbal¬ 
anced  samples  since  test  sets  do  not  overlap,  so  it  exhibits  much  lower  levels  of  Type  I  error. 
The  Type  I  error  rates  of  equal-instance  resampling  (ERS)  are  lower  than  simple  random 
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(a)  Type  I  error  rates  as  the  proportion  of  la¬ 
beled  data  increases. 


Error  Correlation 

(b)  Type  I  error  rates  as  error  correlation 
among  related  instances  increases. 


Fig.  2  Results  for  simulated  classifiers  on  synthetic  data 


resampling;  however,  since  the  improvement  is  not  sufficient  to  make  it  competitive  with 
NCV,  we  do  not  consider  ERS  further. 


6  Empirical  investigation  with  real  classifiers 

This  section  describes  our  investigation  of  the  characteristics  of  statistical  tests  when  com¬ 
paring  real  relational  learning  algorithms.  We  consider  two  collective  inference  models  and 
compare  their  performance  on  synthetic  data.  The  synthetic  data  generation  enables  the  sim¬ 
ulation  of  multiple  draws  of  networks  from  the  same  distribution.  We  evaluate  the  Type  I 
error  and  power  of  statistical  tests,  as  the  performance  of  the  two  models  is  varied. 

6.1  Models 

In  order  to  run  experiments  at  the  scale  needed  to  assess  Type  I  error  and  power  rates,  we 
choose  to  investigate  two  simple  and  efficient  collective  models  described  in  Macskassy  and 
Provost  [19]. 

The  first  model  is  the  weighted- vote  relational  neighbor  (wvRN),  which  estimates  class 
label  probabilities  by  assuming  the  existence  of  homophily.  Given  the  unlabeled  nodes  in  a 
network  Vi  e  Vjj,  wvRN  estimates  P{yi\N{)  as  the  average  of  the  class  probabilities  of  the 
instances  in  Ni  (Vs  neighbors):  P(yt  =  + 1 TV/ )  =  Jf  Xu-eiv-  ^(T/  —  +1  Nj),  where  Z  is  a 
normalizing  constant. 

The  second  model  is  the  network-only  Bayes  classifier  (nBC),  which  estimates  class  label 
probabilities  for  i with  a  multinomial  naive  Bayesian  model,  based  on  the  classes  of  Vs 
neighbors  Nt  :  P(y,  =  +\Nt)  =  f(ipl(H^)(+)  oc =9j\yi  =  +)]^(+),  where 
Z  is  a  normalizing  constant  and  yj  is  the  class  observed  at  node  vj . 

The  wvRN  model  does  not  require  any  learning.  To  estimate  the  parameters  of  the  nBC 
model,  we  use  maximum  likelihood  estimation  over  the  labeled  part  of  the  network.  For 
collective  inference,  we  use  relaxation  labeling  with  both  models. 
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6.2  Data 

The  synthetic  datasets  are  generated  with  a  latent  group  model  [25].  Each  network  is  gen¬ 
erated  with  300  nodes  (instances).  The  nodes  are  generated  as  members  of  (hidden)  groups, 
and  group  membership  determines  the  binary  class  label  values  and  link  existence  for  each 
node.  The  average  group  size  is  10,  and  there  are  two  types  of  groups:  A  and  B.  The  network 
is  skewed  towards  A  groups,  P(A)  =  0.75.  Members  of  A  groups  are  more  likely  to  have 
a  positive  class  label,  P(+|A)  =  0.9.  Members  of  A  groups  also  have  higher  intra-group 
linkage  with  Pa(^/;  =  1| i  c  g£  A  j  e  g£)  =  0.6  and  lower  inter- group  linkage  with 
Pa  fey  =  l\i  e  g£  A  j  £  g£)  =  0.003.  Members  of  B  groups  are  more  likely  to  have  a 
negative  class  label,  P(+\B)  =  0.1.  Members  of  B  groups  have  relatively  lower  intra-group 
linkage  with  Pu(eij  =  1| i  c  g£  A  j  e  g^)  =  0.4  and  higher  inter-group  linkage  with 
Ps(eij  =  1|  i  eg®  A  j  £  g£)  =  0.013.  The  resulting  networks  have  an  average  autocorre¬ 
lation  of  0.40,  an  average  error  correlation  of  0.20,  and  a  class  prior  of  P(+)  =  0.70. 

Our  choice  of  data  generation  parameters  was  designed  to  create  networks  where  the 
wvRN  and  nBC  would  make  different  classification  errors.  Many  of  the  nodes  in  type  B 
groups  have  more  links  to  nodes  in  type  A  groups  so  the  networks  do  not  fully  meet  the 
assumption  of  homophily  which  underlies  the  wvRN  model.  The  nBC  should  thus  more 
accurately  learn  how  to  classify  the  type  B  nodes,  while  the  wvRN  will  likely  be  more  accu¬ 
rate  on  the  type  A  nodes.  However,  since  wvRN  does  not  learn  the  concept  of  homophily,  it 
will  not  experience  variance  due  to  small  labeled  sets. 

6.3  Methodology 

To  estimate  Type  I  error,  we  need  two  models  with  equal  performance  on  partially  labeled 
networks  of  the  same  size,  drawn  from  the  same  domain.  To  achieve  this,  we  measured  the 
average  accuracy  of  the  wvRN  and  the  nBC  models  on  the  synthetic  data  and  handicapped 
the  better  model  (wvRN)  until  the  performance  difference  of  the  models  was  <0.005.  The 
better  performing  model  was  handicapped  by  randomly  selecting  p%  of  it’s  predictions  and 
perturbing  those  probabilities  toward  the  opposite  class.  To  set  p,  we  generated  50  networks 
for  use  as  a  calibration  set.  Each  of  the  50  networks  was  sampled  into  10-fold  network 
cross-validation  sets,  resulting  in  500  training/test  set  splits  on  which  we  measured  average 
accuracy  of  each  model.  Using  this  calibration  set,  we  searched  for  a  value  of  p  that  resulted 
in  a  performance  difference  of  <0.005  between  the  two  models. 

To  estimate  power,  we  need  to  vary  the  performance  difference  between  the  two  models.  To 
achieve  this,  we  perturbed  the  predictions  of  the  worse  performing  model  (nBC)  to  increase 
the  mean  difference  in  performance  between  the  two  models.  For  the  power  experiments,  we 
used  perturbation  rates  of  p  =  [0.025,  0.075,  0.15,  0.3]. 

6.4  Results 

To  measure  Type  I  error  rates  and  power  of  the  statistical  tests,  we  used  four  synthetic 
networks  (in  addition  to  the  calibration  set).  On  each  network,  we  considered  four  levels  of 
labeling:  [0.1,  0.2,  0.3,  0.4].  At  each  level  of  labeling,  we  sampled  the  network  10 times, 
either  by  repeated  sampling  or  by  cross-validation.  On  each  of  the  ten  samples,  we  learned 
the  nBC  model  on  the  labeled  portion  of  the  network,  and  then,  we  applied  both  models  to 
the  unlabeled  portion  of  the  network  with  the  perturbation  rate  p.  We  measured  the  accuracy 
of  each  model  on  the  ten  test  sets  and  then  assessed  the  difference  in  performance  using  a 
Etest. 
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Fig.  3  Type  I  error  rates  for 
synthetic  data  and  real 
classifiers — as  the  proportion  of 
labeled  data  increases 


Proportion  Labeled 


Figure  3  plots  the  Type  I  error  rates  for  four  combinations  of  sampling  method  (RS,  NCV) 
and  statistical  test  (paired  and  unpaired  t- test).  Type  I  error  rates  for  each  dataset  are  mea¬ 
sured  over  100  trials  and  averaged.  Recall  that  we  use  the  calibration  set  to  choose  a  value 
for  p  that  makes  the  average  performance  of  the  wvRN  and  nBC  equal  at  the  given  level  of 
labeling.  Thus,  any  trial  in  which  the  t-test  assesses  the  observed  performance  difference  as 
significant  corresponds  to  a  Type  I  error. 

All  the  tests  have  high  Type  I  error  rates  with  10%  of  instances  labeled.  This  error  generally 
decreases  as  the  amount  of  labeled  data  increases  (and  thus  the  size  of  the  test  set  decreases). 
Since  we  are  using  relatively  simple  classifiers,  as  the  number  of  labeled  data  increase  model 
performance  does  indeed  converge,  and  the  two  models  make  similar  classification  errors  at 
labeling  rates  greater  than  40%.  In  reality,  when  comparing  relational  models  with  different 
representations  and  different  complexity,  we  expect  Type  I  error  to  occur  at  all  levels  of 
labeling. 

Notably,  the  repeated  sampling  approach  experiences  as  high  as  50%  Type  I  errors  (at  20% 
labeling).  This  means  that  half  the  time  the  method  is  concluding  a  significant  performance 
difference  between  the  two  models,  when  in  fact  there  is  none.  For  data  mining  and  machine 
learning  researchers  that  are  investigating  the  trade-offs  between  learning  algorithms,  this  is 
an  unacceptable  level  of  error.  On  the  same  data,  the  network  cross-validation  procedure  error 
rate  is  only  15%  with  the  paired  t-test — a  70%  reduction  in  Type  I  error.  Clearly,  the  network 
cross-validation  approach  results  in  a  more  accurate  comparison  of  model  performance. 

Note  that  in  the  simulated  classifier  experiments  (see  Fig.  2a, b),  NCV  across  the  propor¬ 
tions  labeled  is  equivalent  to  10-fold  CV  at  90%  labeled,  since  performance  in  the  simulated 
classifier  experiments  does  not  depend  on:  (1)  the  number  of  labeled  neighbors  available 
during  inference  and  (2)  the  number  of  instances  available  during  training.  These  additional 
dependencies  contribute  to  the  higher  Type  I  error  observed  for  NCV  with  real  classifiers. 

Although  the  Type  I  error  of  NCV  combined  with  a  paired  t-test  is  much  lower  than 
resampling,  it  is  still  higher  than  the  expected  level  of  5%.  To  investigate  this  behavior,  we 
examined  the  estimates  used  in  the  t- test  calculation:  (1)  the  estimate  of  the  mean  perfor¬ 
mance  difference  between  the  two  models:  i^diff  —  l^A^c  ~  M b acc  and  (2)  the  estimate 
of  the  variance  of  the  differences:  Vardiff  =  Var({Aacc  —  Bacc}i).  The  error  correlation 
increases  the  variance  of  the  estimated  Pdiff ,  but  the  estimates  are  not  biased.  On  the  other 
hand,  the  error  correlation  decreases  the  variance  of  the  estimated  differences,  resulting  in 
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(a)  300  node  network 


(b)  600  node  network 


Fig.  4  Statistical  power  on  synthetic  data  as  the  performance  difference  between  classifiers  increases.  Results 
are  shown  with  30%  of  instances  labeled 


a  biased  underestimate  of  V  ar^iff.  We  considered  the  unpaired  Mest  to  adjust  for  this  bias. 
The  unpaired  t-test  uses  a  pooled  estimate  of  variance  over  the  performance  measurements 
{Aacc}  and  {Bacc},  instead  of  an  estimate  of  variance  of  the  differences.  The  pooled  estimate 
of  variance  is  higher  than  the  variance  of  the  differences,  so  it  can  offset  the  bias  in  variance 
estimation  due  to  error  correlation.  This  is  indeed  the  case — combining  NCV  with  unpaired 
t-tests  results  in  Type  I  error  rates  of  less  than  0.05  (when  proportion  labeled  is  greater  than 
10%). 

Figure  4a  plots  the  statistical  power  of  each  test  as  we  varied  the  average  perfor¬ 
mance  difference  between  wvRN  and  nBC.  More  specifically,  we  used  p  =  [0.025, 
0.075,  0.15,  0.3]  and  measured  the  average  performance  difference  between  the  two  models 
on  the  calibration  set.  This  gives  us  the  mean  performance  difference  that  is  plotted  on  the 
x-axis.  Then,  for  each  of  the  four  evaluation  networks,  we  sampled  the  network  10  times  and 
learned/applied/evaluated  the  models  as  described  above.  Again,  the  results  are  measured 
over  100  trials.  Since  the  two  models  do  perform  differently,  any  trial  in  which  the  t-test  does 
not  conclude  that  the  observed  performance  difference  is  significant  corresponds  to  a  Type 
II  error.  Statistical  power  is  defined  as  the  proportion  of  trials  in  which  the  Mest  correctly 
concludes  that  the  two  models  are  different  (i.e.,  1-Type  II  error).  We  only  plot  the  results 
for  30%  labeled.  The  results  for  other  levels  of  labeling  are  qualitatively  similar. 

The  power  results  illustrate  an  additional  challenge  in  evaluating  the  performance  differ¬ 
ence  between  the  models.  Even  when  there  is  5%  difference  in  the  mean  performance  of  the 
two  algorithms,  it  is  sobering  to  note  that  repeated  sampling  can  detect  this  difference  less 
than  80%  of  the  time.  Network  cross-validation  is  significantly  worse — the  paired  Mest  can 
detect  the  difference  less  than  30%  of  the  time  and  the  unpaired  t-test  less  than  5%  of  the 
time.  This  may  be  due  to  the  difference  in  test  set  size  used  by  the  two  approaches.  Recall 
that  repeated  sampling  uses  all  the  unlabeled  data  for  evaluation,  so  at  30%  labeling,  this 
corresponds  to  210  nodes.  On  the  other  hand,  network  cross-validation  uses  only  10%  of  the 
nodes  for  evaluation  (i.e.,  30  nodes)  regardless  of  the  level  of  labeling  in  the  network. 

To  explore  this  issue,  we  increased  the  dataset  size  to  600  and  measured  the  power  of  each 
approach  again.  Figure  4b  graphs  the  resulting  power  rates.  In  general,  power  of  any  test 
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will  be  increased  as  the  sample  size  is  increased.  However,  here,  we  can  see  that  the  gains 
for  network  cross-validation  are  relatively  larger  than  for  repeated  sampling.  It  is  difficult 
to  compare  across  the  two  sets  of  results  due  to  different  mean  performance  of  the  models. 
However,  if  we  interpolate  between  the  results  in  Fig.  4b  to  assess  the  power  at  5%  mean 
performance  difference  and  compare  to  Fig.  4a  at  5%,  we  can  see  that  doubling  the  dataset 
size  reduced  the  Type  II  error  of  repeated  sampling  by  10%,  but  the  Type  II  error  for  network 
cross-validation  was  reduced  by  45%  (paired  Ftest)  and  70%  (unpaired),  respectively. 


7  Empirical  investigation  with  real-world  data 

This  section  describes  our  investigation  of  the  characteristics  of  statistical  tests  when  com¬ 
paring  relational  learning  algorithms  on  real-world  relational  data.  To  confirm  the  behavior 
we  observed  in  the  synthetic  datasets,  we  compared  the  performance  of  wvRN  and  nBC  on 
data  from  the  National  Longitudinal  Study  of  Adolescent  Health  [13]  as  well  as  data  from 
the  Internet  Movie  Database  (imdb.com). 

The  Adolescent  Health  (Add  Health)  data  consist  of  survey  information  from  144  middle 
and  high  schools,  collected  in  1994-1995.  The  survey  questions  queried  for  the  students’ 
social  networks  along  with  myriad  behavioral  and  academic  attributes.  In  this  paper,  we  con¬ 
sider  the  social  networks  of  six  schools  with  similar  autocorrelation  and  link  patterns.  The 
classification  task  is  to  predict  whether  the  student  smokes  based  on  the  behavior  of  their 
friends  in  the  social  network.  The  six  schools  we  selected  have  sizes  ranging  from  300-700 
nodes,  average  degree  of  7-8,  autocorrelation  in  the  range  [0.25,  0.35],  and  average  error 
correlation  of  0.14. 

Our  second  dataset  is  drawn  from  the  Internet  Movie  Database  (IMDb).  We  collected  a 
sample  of  1,543  movies  released  in  the  United  States  between  2003  and  2007,  with  their 
associated  producers  and  studios.  To  create  six  disjoint  network  samples,  we  partitioned  the 
movies  by  their  associated  studios,  using  stratified  sampling  by  studio  size.  Within  each  par¬ 
tition,  we  created  links  among  movies  with  a  common  producer,  ignoring  any  links  across 
the  partitions.  The  resulting  networks  have  an  average  size  of  257  nodes,  and  the  movies  have 
average  degree  of  16.  The  classification  task  is  to  predict  whether  the  movie  will  make  more 
than  $60 mil  (inflation  adjusted)  in  total  box  office  receipts.  The  average  autocorrelation  in 
these  networks  is  0.35,  and  the  average  error  correlation  is  0.22. 

To  assess  the  Type  I  error  characteristics  of  the  models,  we  used  a  procedure  similar  to 
the  one  described  in  Sect.  6.  Each  trial  considers  one  network  as  the  evaluation  set,  then  we 
calibrate  the  models  on  the  remaining  5  networks  under  the  assumption  that  these  networks 
were  drawn  from  the  same  distribution.  We  sampled  each  of  these  five  networks  10  times 
into  10-fold  NCV  sets,  producing  a  calibration  set  of  500  training/test  splits.  As  described 
previously,  we  searched  for  a  value  of  p  that  resulted  in  a  performance  difference  of  <0.005 
between  the  two  models.  We  considered  five  levels  of  labeling:  [0.1,  0.2,  0.3,  0.4,  0.5],  cal¬ 
ibrated  the  models  at  each  level  of  labeling,  and  measured  the  Type  I  error  on  the  held  out 
network.  The  models  converge  in  performance  at  50-60%  labeling  (i.e.,  p  =  0)  so  we  do  not 
consider  larger  labeling  rates. 

Figure  5  shows  Type  I  error  for  each  combination  of  sampling  method  and  statistical 
test,  measured  over  50  trials.  Figure  5a  and  b  show  the  results  for  the  AddHealth  and  IMDb 
datasets,  respectively. 

As  expected,  the  statistical  tests  exhibit  similar  behavior  on  the  real  relational  data  and  the 
synthetic  data.  Again  resampling  produces  unacceptable  levels  of  Type  I  error  (up  to  40%) 
and  network  cross-validation  has  more  reasonable  error  rates.  Overall  Type  I  error  decreases 
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Proportion  Labeled  Proportion  Labeled 

(a)  Add  Health  (b)  IMDb 

Fig.  5  Type  I  error  rates  on  real  relational  datasets,  as  the  proportion  of  labeled  data  increases 


as  the  proportion  of  labeled  data  increases  to  50-60%  (i.e.,  test  set  size  decreases).  Recall, 
however,  that  we  are  investigating  simple  models  that  are  nearly  equivalent  on  a  restricted  task 
involving  only  the  class  label  and  no  other  attribute/link  features.  In  practice,  as  the  complex¬ 
ity  of  models  and  concepts  increase,  Type  I  errors  are  likely  to  occur  at  all  levels  of  labeling. 


8  Conclusions  and  discussion 

In  this  paper,  we  examined  the  characteristics  of  statistical  tests  for  comparing  within-network 
classification  algorithms.  We  presented  three  resampling  procedures  and  three  significance 
tests,  performed  experiments  on  both  real  and  synthetic  data  using  real  and  simulated  clas¬ 
sifiers. 

Our  analysis  shows  that  a  commonly  used  form  of  evaluation  in  relational  learning  (paired 
t-tests  on  overlapping  network  samples)  can  result  in  unacceptably  high  levels  of  Type  I  error 
(as  high  as  50%).  High  Type  I  error  indicates  that  many  algorithm  differences  will  be  judged 
incorrectly  as  significant  when  in  fact  performance  is  equivalent.  Although  for  efficiency 
reasons  we  considered  relatively  simple  relational  models  for  this  work,  our  findings  apply 
to  evaluations  of  more  complex  relational  models  as  well — since  any  relational  model  that 
attempts  to  exploit  relational  autocorrelation  is  likely  to  produce  correlated  errors. 

Furthermore,  we  demonstrated,  both  theoretically  and  empirically,  that  Type  I  error 
increases  as  (1)  the  correlation  among  instances  increases  and  (2)  the  overlap  among  the 
evaluation  sets  increase  (i.e.,  the  proportion  of  labeled  nodes  in  the  network  decreases). 

Although  we  investigated  the  properties  of  significance  tests  for  within-network  classifi¬ 
cation,  the  findings  are  also  applicable  to  across -network  tasks  and  other  forms  of  hypothesis 
testing  (e.g.,  standard  error  bars  will  be  underestimated).  The  extent  of  the  effect  will  depend 
on  the  level  of  observed  autocorrelation  (which  will  cause  error  correlation),  as  well  as  the 
amount  of  overlap  between  samples. 

We  proposed  a  method  for  network  cross-validation  that  reduces  the  overlap  between 
samples.  We  note  that  although  the  method  creates  disjoint  test  sets,  the  predictions  for  those 
test  set  instances  will  be  influenced  by  other  predictions  in  the  unlabeled  inference  set  (due 
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to  the  collective  inference  process).  This  means  that  there  is  still  some  dependency  between 
test  sets  (since  the  inference  sets  overlap)  which  could  increase  the  Type  I  error  of  NCV. 

Our  empirical  evaluation  shows  that  NCV  combined  with  unpaired  t-tests  results  in  low 
levels  of  Type  I  error.  However,  this  low  error  is  achieved  at  the  expense  of  statistical  power 
(i.e.,  Type  II  error).  NCV  combined  with  paired  Mests  produces  more  acceptable  levels  of 
Type  I  error  while  still  providing  reasonable  levels  of  statistical  power. 

Promising  future  directions  for  this  work  include:  (1)  using  patterns  (e.g.,  communities) 
in  relational  data  to  split  train/test  data  (e.g.,  stratified  by  community,  or  biased  by  commu¬ 
nity);  (2)  investigating  non-random  labeling  patterns  and  their  impact  on  error  correlation  for 
different  collective  inference  methods;  and  (3)  investigating  how  characteristics  of  relational 
data  affect  the  power  of  statistical  tests  (i.e.,  Type  II  error). 
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